Revisiting k-means: New Algorithms via Bayesian Nonparametrics
نویسندگان
چکیده
Bayesian models offer great flexibility for clustering applications—Bayesian nonparametrics can be used for modeling infinite mixtures, and hierarchical Bayesian models can be utilized for sharing clusters across multiple data sets. For the most part, such flexibility is lacking in classical clustering methods such as k-means. In this paper, we revisit the k-means clustering algorithm from a Bayesian nonparametric viewpoint. Inspired by the asymptotic connection between kmeans and mixtures of Gaussians, we show that a Gibbs sampling algorithm for the Dirichlet process mixture approaches a hard clustering algorithm in the limit, and further that the resulting algorithm monotonically minimizes an elegant underlying k-means-like clustering objective that includes a penalty for the number of clusters. We generalize this analysis to the case of clustering multiple data sets through a similar asymptotic argument with the hierarchical Dirichlet process. We also discuss further extensions that highlight the benefits of our analysis: i) a spectral relaxation involving thresholded eigenvectors, and ii) a normalized cut graph clustering algorithm that does not fix the number of clusters in the graph.
منابع مشابه
Multiagent Planning with Bayesian Nonparametric Asymptotics
Autonomous multiagent systems are beginning to see use in complex, changing environments that cannot be completely specified a priori. In order to be adaptive to these environments and avoid the fragility associated with making too many a priori assumptions, autonomous systems must incorporate some form of learning. However, learning techniques themselves often require structural assumptions to...
متن کاملExploiting Big Data in Logistics Risk Assessment via Bayesian Nonparametrics
Exploiting Big Data in Logistics Risk Assessment via Bayesian Nonparametrics by Yan Shang Department of Statistical Science Duke University
متن کاملNon-parametric Power-law Data Clustering
It has always been a great challenge for clustering algorithms to automatically determine the cluster numbers according to the distribution of datasets. Several approaches have been proposed to address this issue, including the recent promising work which incorporate Bayesian Nonparametrics into the k-means clustering procedure. This approach shows simplicity in implementation and solidity in t...
متن کامل708 , Spring 2014 19 : Bayesian Nonparametrics : Dirichlet Processes
In parametric modeling, it is assumed that data can be represented by models using a fixed, finite number of parameters. Examples of parametric models include clusters of K Gaussians and polynomial regression models. In many problems, determining the number of parameters a priori is difficult; for example, selecting the number of clusters in a cluster model, the number of segments in an image s...
متن کاملDistributional properties of means of random probability measures
The present paper provides a review of the results concerning distributional properties of means of random probability measures. Our interest in this topic has originated from inferential problems in Bayesian Nonparametrics. Nonetheless, it is worth noting that these random quantities play an important role in seemingly unrelated areas of research. In fact, there is a wealth of contributions bo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1111.0352 شماره
صفحات -
تاریخ انتشار 2012